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The potential use of Automatic Speech Recognition to assist receptive communication is explored. 
The opportunities and challenges that this technology presents students and staff to provide 
captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, 
visually impaired or dyslexic learners to read and search learning material more readily by augment- 
ing synthetic speech with natural recorded real speech is also discussed and evaluated. The auto- 
matic provision of online lecture notes, synchronised with speech, enables staff and students to 
focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or 
who find it difficult or impossible to take notes at the same time as listening, watching and thinking. 


Introduction 

Students in the United Kingdom who find it difficult or impossible to write using a 
keyboard may use Automatic Speech Recognition (ASR) to assist or enable their 
written expressive communication (Banes & Seale, 2002; Draffan, 2002; Hargrave- 
Wright, 2002). In a report to the English Higher Education Funding Council it was 
noted that one of the ‘key issues for teaching’ with regard to information and commu- 
nications technology was the opportunities for such technologies as speech recogni- 
tion software: 

The importance of this development is that it will change the nature of interaction with 
computers. Word commands will make it easier to operate a computer, particularly for 
people with relatively low literacy skills. This, in turn, will have major implications for the 
design of learning materials. (JM Consulting, 2002) 
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No mention was made, however, of the use of ASR to assist students who find it diffi- 
cult or impossible to understand speech, with their receptive communication of 
speech in class or online. 

UK Disability Discrimination Legislation states that reasonable adjustments 
should be made to ensure that disabled students are not disadvantaged (HMSO, 
2001), and so it would appear reasonable to expect that adjustments should be made 
to ensure that multimedia materials including speech are accessible if a simple and 
inexpensive method to achieve this was available. Since providing a text transcript of 
a video does not necessarily provide equivalent information for a disabled learner, the 
government-funded Skills for Access website, 1 which describes itself as ‘the compre- 
hensive guide to creating accessible multimedia for e-learning’, currently advises that 
the most desirable accessibility solution is to: 

[...] provide the video with text captions for all spoken and other audio content [...] There 
is no ‘reasonable’ reason for not captioning video clips from a ‘widening access’ point of view. 

and that if resource limitations prohibit providing a reasonable alternative experience 
for those who cannot hear the video, the ‘reasonable adjustment’ argument: 

[. . .] begs the question: should you be using video clips at all? 

The Skills for Access website reports that it took a number of skilled workers many 
tens of hours to caption the video clips used on its site. As video and speech become 
more common components of online learning materials, the need for captioned 
multimedia with synchronised speech and text, as recommended by the Web Content 
Accessibility Guidelines (WAI, 1999), can be expected to increase, and so finding an 
affordable method of captioning will become more important to help support a 
‘reasonable adjustment’. 

This paper explores how using ASR can help provide a cost-effective way to assist 
and enable receptive communication, help ensure e-learning is accessible and 
enhance the quality of learning and teaching. 


Use of captions and transcription in education 

Deaf and hard of hearing people can find it difficult to follow speech through hearing 
alone, or to take notes while they are lip-reading or watching a sign-language inter- 
preter. Although summarised notetaking and sign language interpreting is currently 
available, notetakers can only record a small fraction of what is being said while 
qualified sign language interpreters with a good understanding of the relevant higher 
education subject content are in very scarce supply (RNID, 2005): 

There will never be enough sign language interpreters to meet the needs of deaf and hard 
of hearing people, and those who work with them. 

Some deaf and hard of hearing students may also not have the necessary higher 
education subject-specific sign language skills. Students may consequently find it 
difficult to study in a higher education environment or to obtain the qualifications 
required to enter higher education. 
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Stinson et al. (1988) reported that deaf or hard of hearing students at Rochester 
Institute of Technology who had good reading and writing proficiency preferred real- 
time verbatim text displays (i.e. similar to television subtitles/captions) to interpreting 
and/or notetaking. They have therefore developed the use of ASR re-voicing for the 
C-Print system in classrooms (Francis & Stinson, 2003), where the ‘notetaker’ 
repeats what the lecturer is saying into a special microphone ‘mask’ that reduces the 
sound heard by others: 

An extensive program of research has provided evidence that the C-Print system works 
effectively in public school and postsecondary educational settings. 

Robison and Jensema (1996) identified the value of speech recognition to over- 
come the difficulties that sign language interpreting had with foreign languages and 
specialist subject vocabulary, for which there are no signs: 

Fingerspelling words such as these slows down the interpreting process while potentially 
creating confusion if the interpreter or student is not familiar with the correct spelling. 

Downs et al. (2002) identifies the potential of speech recognition in comparison to 
summary transcription services and students reporting programmes unable to keep 
up with the information flow in the classroom: 

The deaf or hard of hearing consumer is not aware, necessarily, whether or not s/he is 
getting the entirety of the message. 

Although UK government funding is available to deaf and hard of hearing students 
in higher education for interpreting or notetaking services, real-time captioning has 
not yet been used because of the shortage of trained stenographers wishing to work 
in universities. Since universities in the United Kingdom do not have direct respon- 
sibility for funding or providing interpreting or notetaking services, there would 
appear to be less incentive for them to investigate the use of ASR in classrooms as 
compared with universities in Canada, Australia and the United States. 

ASR offers the potential to provide automatic real-time verbatim captioning for 
deaf and hard of hearing students or for any user of systems when speech is not 
available, suitable or audible. Students, especially those whose first language is not 
English, may also find it easier to follow the captions and transcript than to follow the 
speech of the lecturer who may have a dialect, accent or not have English as their first 
language. 

In lectures/classes students can spend much of their time and mental effort trying 
to take notes. This is a very difficult skill to master for any student, especially if the 
material is new and they are unsure of the key points, as it is difficult to simulta- 
neously listen to what the lecturer is saying, read what is on the screen, think care- 
fully about it and write concise and useful notes. The automatic provision of a live 
verbatim displayed transcript of what the teacher is saying, archived as accessible 
lecture notes, would therefore enable staff and students to concentrate on learning 
and teaching issues (e.g. students could be asked searching questions in the 
knowledge that they had the time to think) as well as benefiting students who find 
it difficult or impossible to take notes at the same time as listening, watching and 
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thinking or those who are unable to attend the lecture (e.g. for mental or physical 
health reasons). Lecturers would also have the flexibility to stray from a pre-prepared 
‘script’, safe in the knowledge that their spontaneous communications will be 
‘captured’ permanently. 


Enhancing teaching and learning through reflection 

Poor oral presentation skills of teachers can affect all students, but is particularly an 
added disadvantage for disabled students and students whose first language is not 
English. Using ASR to capture all presentations in synchronised and transcribed form 
allows teachers to monitor and review what they have said and reflect on it to improve 
their teaching and the quality of their spoken communication. 


Access to preferred modality of communication 

Teachers may have preferred teaching styles involving the spoken or written word 
that may differ from learners’ preferred learning styles (e.g. teacher prefers spoken 
communication, while student prefers reading). Speech, text and images have 
communication qualities and strengths that may be appropriate for different content, 
tasks, learning styles and preferences. Speech can express feelings that are difficult to 
convey through text (e.g. presence, attitudes, interest, emotion and tone) and that 
cannot be reproduced through speech synthesis. Images can communicate informa- 
tion permanently and holistically, and simplify complex information and portray 
moods and relationships. Students can usually read much faster than a teacher speaks 
and so find it possible to switch between listening and reading. When a student 
becomes distracted or loses focus it is easy to miss or forget what has been said, 
whereas text reduces the memory demands of spoken language by providing a lasting 
written record that can be reread. 


Benefits of synchronised multimedia for learning and teaching 

Synchronising multimedia means that text, speech and images can be linked together 
by the stored timing of information, and this enables all the communication qualities 
and strengths of speech, text and images to be available as appropriate for different 
content, tasks, learning styles and preferences. Some students, for example, may find 
the more colloquial style of verbatim-transcribed text from spontaneous speech, 
easier to follow than an academic written style. 


Creating synchronised multimedia 

Synchronised multimedia learning and teaching materials can offer many benefits 
for students but can be difficult to create access, manage and exploit. Tools that 
synchronise pre-prepared text and corresponding audio files, either for the produc- 
tion of electronic books (e.g. Dolphin 2 ) based on the DAISY 3 specifications or for 
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the captioning of multimedia (e.g. MAGpie 4 ) using, for example, the Synchronized 
Multimedia Integration Language, 5 are not normally suitable or cost-effective for 
use by teachers for the ‘everyday’ production of learning materials with accessible 
speech captions or transcriptions. This is because they depend on either a teacher 
reading a prepared script aloud, which can make a presentation less natural sound- 
ing and therefore less effective, or on obtaining a written transcript of the lecture, 
which is expensive and time consuming to produce. ASR can improve the usability 
and accessibility of e-learning through the cost-effective production of synchronised 
and captioned multimedia. 


Advantages of recorded speech compared with synthetic speech 

Synchronised speech and text can assist blind, visually impaired or dyslexic learn- 
ers to read and search text-based learning material more readily by augmenting 
unnatural synthetic speech with natural recorded real speech. Although speech 
synthesis can provide access to some text based materials for blind, visually 
impaired or dyslexic learners, it can be difficult and unpleasant to listen to for long 
periods and cannot match synchronised real recorded speech in conveying ‘peda- 
gogical presence’, attitudes, interest, emotion and tone, and communicating words 
in a foreign language and descriptions of pictures, equations, tables, diagrams, and 
so on. 


ASR feasibility trials 

Feasibility trials using existing commercially available ASR software to provide a 
real-time verbatim displayed transcript in lectures for deaf students in 1998 by the 
author in the United Kingdom (Wald, 2000) and St Mary’s University, Nova 
Scotia in Canada identified that standard speech recognition software (e.g. Dragon, 
ViaVoice [Scansoft, 6 2005]) was unsuitable as it required the dictation of punctua- 
tion, which does not occur naturally in spontaneous speech in lectures. The soft- 
ware also stored the speech synchronised with text in proprietary non-standard 
formats for editing purposes only — so that when the text was edited, speech and 
synchronisation could be lost. Without the dictation of punctuation, the ASR soft- 
ware produced a continuous unbroken stream of text that was very difficult to read 
and comprehend. Attempts to insert punctuation by hand in real time proved 
unsuccessful. The trials, however, showed that reasonable accuracy could be 
achieved by interested and committed lecturers who spoke very clearly and care- 
fully after extensively training the system to their voice by reading the training 
scripts and teaching the system any new vocabulary that was not already in the 
dictionary. 

Based on these feasibility trials the international Liberated Learning Collaboration 
was established by Saint Mary’s University, Nova Scotia, Canada in 1999, and since 
then the author has continued to work with IBM and Liberated Learning to investi- 
gate how ASR can make speech more accessible. 
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Automatic formatting 

It is very difficult to usefully automatically punctuate transcribed spontaneous 
speech as ASR systems can only recognise words and cannot understand the 
concepts being conveyed. Further investigations and trials demonstrated it was 
possible to develop an ASR application that automatically formatted the transcrip- 
tion by breaking up the continuous stream of text based on the length of the pauses/ 
silences in the speech stream. Since people do not naturally spontaneously speak in 
complete sentences, attempts to insert conventional punctuation (e.g. a comma for 
a shorter pause and a fullstop for a longer pause) in the same way as normal written 
text did not provide a very readable and comprehensible display of the speech. A 
more readable approach was achieved by providing a visual indication of pauses 
showing how the speaker grouped words together (e.g. one new line for a short 
pause and two for a long pause; it is, however, possible to select any symbols as 
pause markers). 

Text created automatically from spontaneous speech using ASR usually has a 
more colloquial style than academic written text and, although students may prefer 
this, some teachers had some concerns that this would make it appear that they 
had poor writing skills. Editors were therefore used to correct and punctuate the 
transcripts before making them available to students online after the lectures. 
However, lecturers’ concerns that a transcript of their spontaneous utterances will 
not look as good as carefully prepared and hand-crafted written notes can be met 
with the response that students at present can tape a lecture and then get it 
transcribed. Students are capable of understanding the different purposes and 
expectations of a verbatim transcript of spontaneous speech and pre-prepared writ- 
ten notes. 


The ‘Liberated Learning’ concept 

The potential of using ASR to provide automatic captioning of speech in higher 
education classrooms has now been demonstrated in ‘Liberated Learning’ classrooms 
in the United States, Canada and Australia (Bain et al., 2002; Wald, 2002; Leitch & 
MacMillan, 2003). Lecturers spend time developing their ASR voice profile by train- 
ing the ASR software to understand the way they speak. This involves speaking the 
enrolment scripts, adding new vocabulary not in the system’s dictionary and training 
the system to correct errors it has already made so that it does not make them in the 
future. Lecturers wear wireless microphones, providing the freedom to move around 
as they are talking, while the text is displayed in real time on a screen using a data 
projector so students can simultaneously see and hear the lecture as it is delivered. 
After the lecture the text is edited for errors and made available for students on the 
Internet. 

To make the Liberated Learning vision a reality, the prototype ASR application 
‘Lecturer’, developed in 2000 in collaboration with IBM, was superseded the follow- 
ing year by ‘IBM ViaScribe’. Both applications used the ViaVoice ‘engine’ and its 
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corresponding training of voice and language models, and automatically provided 
text displayed in a window and stored for later reference synchronised with the 
speech. ViaScribe used a standard file format (e.g. SMIL) enabling synchronised 
audio and the corresponding text transcript and slides to be viewed on an Internet 
browser or through media players that support the SMIL 2.0 standard for accessible 
multimedia. 

ViaScribe 7 can automatically produce a synchronised captioned transcription of 
spontaneous speech using automatically triggered formatting from live lectures, or in 
the office, or even, using speaker-independent recognition, from recorded speech files 
on a website (Bain et al., 2005). 


Accuracy 

Studies to date have shown that it has proved difficult to obtain an accuracy of 
over 85% in all higher education classroom environments directly, from the speech 
of all teachers (Leitch & MacMillan, 2003). It was also observed that lecturers’ 
ASR accuracy rates were lower in classes compared with those achieved in the 
office environment. This has also been noted elsewhere (Bennett et al., 2002). 
Informal investigations have suggested this might be because the rate of delivery 
varied more in a live classroom situation than in the office, resulting in the ends of 
words being run into the start of subsequent words. It is important to note that the 
standardised statistical measurement of recognition accuracy by noting recogni- 
tion ‘errors’ does not necessarily mean that the error affected readability or under- 
standing (e.g. substitution of ‘the’ for ‘a’). It is difficult, however, to devise a 
standard measure for ASR accuracy that takes readability and comprehension into 
account. 


Student and teacher feedback 

Detailed feedback (Leitch & Macmillan, 2003) from 44 students with a wide 
range of physical, sensory and cognitive disabilities and interviews with lecturers 
showed that both students and teachers generally liked the Liberated Learning 
concept and felt it improved teaching and learning as long as the text was reason- 
ably accurate (i.e. >85%). Many students developed strategies to cope with errors 
in the text and the majority of students used the text as an additional resource to 
verify and clarify what they heard (e.g. 87% of students surveyed reported 
watching the screen in class, 69% reported comparing their own notes with the 
digitised text and 63% reported retrieving the online notes). Typical comments 
were: 

It gives you something to compare your notes to and if you miss a class the notes are still 
accessible. 

It’s very helpful when the lecturer moves on while you’re still writing down a point as you 
can look at the screen. 
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Coping with multiple speakers 

In Liberated Learning classrooms, lecturers repeated questions from students so 
they appeared transcribed on the screen. In interactive group sessions, in order that 
contributions, questions and comments from all speakers could be transcribed 
directly into text, each speaker would at present need to have their own separate 
personal ASR system trained to their voice. 

Current and planned developments 

Liberated Learning research and development has continued to try improving the 
usability and performance through training users, simplifying the interface and 
improving the display readability. In addition to continuing classroom trials in the 
USA, Canada and Australia, new trials will occur in the United Kingdom, China 
and Japan. Research and development also continues on developing the ASR appli- 
cation. MIT is a member of the Liberated Learning collaboration and is working to 
share information to assist the incorporation of speech recognition technology into 
MIT OpenCourseWare to help students find and review lecture materials (Hazen & 
Barzilay, 2005). 


Improving accuracy through editing and/or re-voicing 

Although it can be expected that developments in ASR will continue to improve accu- 
racy rates, the use of a human intermediary to improve accuracy through re-voicing 
and/or correcting mistakes in real time as they are made by the ASR software could, 
where necessary, help compensate for some of ASR’s current limitations 

It is possible to edit errors in the synchronised speech and text to insert, delete or 
amend the text with the timings being automatically adjusted. For example, an 
‘editor’ correcting 15 words per minute would improve the accuracy of the 
transcribed text from 80% to 90% for a speaker talking at 150 words per minute. Not 
all errors are equally important, and so the editor can use their initiative to prioritise 
those that most affect readability and understanding. 

An experienced trained ‘re-voicer’ using ASR by repeating very carefully and 
clearly what has been said can improve accuracy over the original speaker using ASR 
where the original speech is not of sufficient volume/quality or when the system is not 
trained (e.g. telephone, Internet, television, indistinct speaker, multiple speakers, 
meetings, panels, audience questions). Re-voiced ASR is sometimes used for live 
television subtitling in the United Kingdom (Lambourne et al., 2004) and in class- 
rooms and courtrooms in the United States (Francis & Stinson, 2003) using a mask 
to reduce background noise and disturbance to others. 

While one person acting as both the re-voicer and editor could attempt to create 
real-time edited re-voiced text, this would be more problematic if a lecturer 
attempted to edit ASR errors while they were giving their lecture. However, a person 
editing their own ASR errors to increase accuracy might be more acceptable when 
using ASR to communicate one-to-one with a deaf person. 


Automatic Speech Recognition and receptive communication 17 


Improving usability and performance 

Current unrestricted vocabulary ASR systems normally are speaker dependent and 
so require the speaker to train the system to the way they speak, any special vocab- 
ulary they use and the words they most commonly employ when writing. This 
normally involves initially reading aloud from a provided training script, providing 
written documents to analyse, and then continuing to improve accuracy by improv- 
ing the voice and language models by correcting existing words that are not recogn- 
ised and adding any new vocabulary not in the dictionary. Current research 
includes a new integrated speech recognition engine (‘Lecturer’ and ‘ViaScribe’ 
required the ViaVoice ASR engine) and providing ‘pre-trained’ voice models (the 
most probable speech sounds corresponding to the acoustic waveform) and 
language models (the most probable words spoken corresponding to the phonetic 
speech sounds) from samples of speech, so the user does not need to spend the 
time reading training scripts to improve the voice or language models. This should 
also help ensure better accuracy for a speaker’s specialist subject vocabularies and 
also spoken spontaneous speech structures, which will differ from their more formal 
written structures. Speaker independent systems currently usually generate lower 
accuracy than trained models but systems can improve accuracy with exposure to 
the speaker’s voice. 


Personalised displays 

Liberated Learning’s research has shown that while projecting the text onto a large 
screen in the classroom has been used successfully, it is clear that in many situations 
an individual personalised and customisable display would be preferable or essential. 
An application is therefore being developed to provide users with their own personal 
display on their own web-enabled wireless systems (e.g. computers, PDAs, mobile 
phones, etc.) customised to their preferences (e.g. font, size, colour, text formatting 
and scrolling). 


Highlighting, annotation and manipulation of synchronised material 

Since it would take students a long time to read through a verbatim transcript after a 
lecture and summarise it for future use, it would be valuable for students to be able 
to create an annotated summary for themselves in real time through highlighting, 
selecting and saving key sections of the transcribed text and adding their own words 
time linked with the synchronised transcript. 


Managing, searching and indexing multimedia 

It is difficult to search multimedia materials (e.g. speech, video, PowerPoint files), 
and using ASR to synchronise speech with transcribed text files can assist learners and 
teachers to manipulate, index, bookmark, manage and search for online digital 
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multimedia resources that include speech by means of the synchronised text. Stan- 
dard synchronised multimedia streams do not currently offer a simple way to achieve 
this. 


Conclusion 

It would appear to be reasonable to expect educational material produced by staff and 
students to be accessible to disabled students whenever possible, and audiovisual 
material in particular can benefit from captioning. Screen readers using speech 
synthesis can provide access to many materials but it will also sometimes be helpful 
to provide real synchronised speech. ASR enables academic staff to take a proactive 
rather than a reactive approach to teaching students with disabilities by providing 
practical, economic methods to make their teaching accessible and assist learners to 
manage and search online digital multimedia resources. This can improve the quality 
of education for all students because the automatic provision of accessible synchro- 
nised lecture notes enables students to concentrate on learning and enables teachers 
to monitor and review what they said and reflect on it to improve their teaching. 

The only ASR application that is currently being used in classrooms to provide a 
real-time synchronised and editable transcription would appear to be IBM ViaScribe; 
therefore, to further research and develop the use of ASR, applications need to 
continue to be developed for use by researchers, staff and students. For example 
ViaScribe automatically produces a phonetic transcription, and this could be used for 
‘phonetic searching’ (Clements et al., 2002) without the need to correct ASR errors 
in the transcript. Phonetic searching is faster than searching the original speech and 
can also help overcome ASR ‘out of vocabulary’ errors that occur when words spoken 
are not known to the ASR system, as it searches for words based on their phonetic 
sounds not their spelling. 


Notes 

1. Skills for Access: http://www.skillsforaccess.org.uk 

2. Dolphin: http://www.dolphinaudiopublishing.com/products/EasePublisher/index.htm 

3. DAISY: http://www.daisy.org 

4. MAGpie: http://ncam.wgbh.org/webaccess/magpie/ 

5. SMIL: http://www.w3.org/AudioVideo/ 

6. Scansoft: http://www.nuance.com/ 

7. ViaScribe: http://www-306.ibm.com/able/solution_offerings/ViaScribe.html 
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